R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut

Transformer-based models have gained popularity in the field of natural language processing (NLP) and are extensively utilized in computer vision tasks and multi-modal models such as GPT4. This paper presents a novel method to enhance the explainability of transformer-based image classification models. Our method aims to improve trust in classification results and empower users to gain a deeper understanding of the model for downstream tasks by providing visualizations of class-specific maps. We introduce two modules: the “Relationship Weighted Out” and the “Cut” modules. The “Relationship Weighted Out” module focuses on extracting class-specific information from intermediate layers, enabling us to highlight relevant features. Additionally, the “Cut” module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color. By integrating these modules, we generate dense class-specific visual explainability maps. We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset. Furthermore, we conduct a large number of experiments on the LRN dataset, which is specifically designed for automatic driving danger alerts, to evaluate the explainability of our method in scenarios with complex backgrounds. The results demonstrate a significant improvement over previous methods. Moreover, we conduct ablation experiments to validate the effectiveness of each module. Through these experiments, we are able to confirm the respective contributions of each module, thus solidifying the overall effectiveness of our proposed approach.


I. INTRODUCTION
E XPLAINABLE Machine learning has garnered signifi- cant attention in recent years.It refers to the ability of a machine learning model to provide an easily understandable causal relationship that explains the process of model prediction, thereby enhancing human confidence and facilitating model debugging for downstream tasks [1], [2].
Explainability in deep learning models can be categorized into two main types [2].The first category is intrinsic interpretability, which includes models with relatively simple structures like decision trees [3], logistic regression [4], and linear regression [5].These models have transparent internal logic structures that can be readily understood during the model design process.However, their accuracy is generally lower compared to mainstream deep learning models.The second category is post-hoc explainability, which involves Mr. Niu is with the Department of Informatics, Nagoya University, Japan.e-mail: niu.yingjie@g.sp.m.is.nagoya-u.ac.jpR-Out "elephant" "zebra"

Classification results
Explainability maps

Input image
Cut Cut Fig. 1.Overview of our method.Our method can generate a class-specific post-hoc explainability map for different results after the "R-Out" and "Cut" steps.
In the field of computer vision, a large amount of work has focused on increasing the explainability of CNNs by post-hoc visualization of discriminative regions associated with targets in input images.
The emergence of Vision Transformers (ViTs) has revolutionized computer vision.Transformer-based methods, such as Swin-transformer [16] and PVT [15], have surpassed traditional techniques and achieved state-of-the-art (SOTA) performance in various computer vision tasks, including image classification, object detection, and semantic segmentation.Moreover, transformers have played a critical role in advancing multi-modal models such as CLIP [17], ALBEF [18], BLIP [19], and GLIP [20].Additionally, transformers have been instrumental in the development of large language models (LLMs) [21], which have gained widespread popularity.However, as the application of transformers expands, the need for explainability methods becomes crucial.These methods enhance users' confidence in model results and facilitate the arXiv:2307.09050v1[cs.CV] 18 Jul 2023 debugging process, ultimately leading to improved performance in downstream tasks.Exploring explainability methods for transformers is a promising avenue to refine and optimize the performance of these models.
Despite these advancements, there are few contributions exploring the explainability of the ViT series of models.Most existing approaches only consider the direct use of the raw-attention map corresponding to the class token in the multi-head self-attention (MHSA) module to directly generate explainability maps in ViT [22], [23], [24].However, these methods often adopt a class-agnostic approach, and the generated explainability maps tend to emphasize salient features while containing substantial noise.To address the noise problem associated with explainability methods based on the self-attention map, Abnar et al. proposed a method called attention rollout [25].Although this approach improves the noise problem of raw attention to some extent but often struggles to distinguish between true foreground and background regions.
Another approach proposed by Chefer et al. utilizes the Deep Taylor Decomposition principle to assign relevance and improve the problem mentioned above [26].By combining the information from back-propagation gradients, this method achieves class-specific explainability.However, the presence of activation functions in the back-propagation process can lead to gradient vanishing and other issues, resulting in sparse and noisy explainability feature maps as outputs.
In our research, we propose a post-hoc visualization explainability method called Relationship Weighted Out and Cut (R-Cut) with the objective of generating dense, low-noise, and class-specific explainability images for visual domain transformers and their derivative models.R-Cut consists of a two-stage extraction method, as illustrated in Figure 1.In the first stage, we propose a module called "Relationship Weighted Out (R-Out)" to extract the class-specific semantic features from the intermediate vectors.In the second stage, we propose a feature decomposition technique called "Cut" to decompose the class-specific semantic features into finegrained foreground and background components.
To validate the effectiveness of our method, we conducted qualitative and quantitative experiments on the widely-used ImageNet1K dataset [27] and compared it with other SOTA methods.We also conducted experiments on LRN dataset [28] designed for the automated driving hazard alert, that we created to test the explainability of our method in the presence of complex backgrounds.Furthermore, we performed ablation experiments to verify the effectiveness of the different modules proposed in our approach.Moreover, we conducted comparative experiments on various hyperparameters to validate their effectiveness.These comprehensive experiments aimed to provide evidence supporting the superiority of our method compared to existing approaches in terms of performance on standard benchmarks and its ability to handle complex scenarios.
This paper makes two main contributions: 1) we propose a dense, low-noise, class-specific posthoc visualization explainability method for transformer-based models and their derivative models.The method achieves SOTA performance on the ImageNet1K dataset.2) We conducted extensive explainability experiments to validate the effectiveness of the proposed method in the context of autonomous driving scenarios with complex backgrounds.This contribution highlights the practical application of the method in real-world scenarios and demonstrates its ability to provide meaningful explanations even in challenging and intricate environments.

A. CNN Explainability
In the field of computer vision, specifically for CNNs, a significant amount of research has focused on improving the interpretability of neural network models by generating posthoc visualizations of discriminative regions related to targets in input images [29], [30], [31], [32], [33], [34], [35], [36].There are three main groups of post-hoc visualization methods that aim to enhance the explainability of neural network models in computer vision: CAM-based approaches, Gradient-based approaches, and perturbation-based methods.
CAM-based approaches generate visual interpretation maps by linearly weighting the combination of activation maps from the last convolutional layer [29], [30], [32], [33].These approaches often have specific requirements for the network structure, such as the presence of a global pooling layer after the convolutional layer.
Gradient-based approaches [30], [32], [34], [36]identify regions in input images that contribute most to the network's output by backpropagating the gradient of the target category to the input image.However, this approach can suffer from gradient saturation and gradient vanishing issues due to the activation function, leading to noise in the generated gradient map.Additionally, Wang et al. [37] have demonstrated that the gradient map-based approach can be susceptible to a falseconfidence issue.
Perturbation-based approaches [38], [39], [40], [41] determine the discriminative regions associated with the target by perturbing the input image and observing the change in confidence in the corresponding prediction.This approach provides more intuitive and easily understandable explainability maps.However, these methods often require the manual design of perturbation maps.

B. ViT Explainability
Currently, there remain few studies focusing on the explainability of methods belonging to the ViT family.Some approaches have been proposed to generate explainability maps directly from the raw-attention map corresponding to the cls-token [22], [23], [24].These approaches involve recording the self-attention maps generated by the self-attention heads of the last block in the ViT model during inference.The final explainability attention map can be obtained by averaging the attention vectors corresponding to the cls-token in these selfattention maps.This explainability method is class-agnostic similar to a saliency map and able to highlight several objects at the same time, even if they belong to different classes in the input.
However, the main challenge of these methods is the significant differences between the attention vectors of each head, which can introduce noise when taking the mean of the self-attention maps.Abnar et al. [25] proposed a method called attention rollout to solve the problem.They argued that in Transformer-based models, the self-attention results need to be passed through a skip-connection.Treating the rawattention map as the sole source of explainable information would neglect the information processed during the skipconnection [42].
Furthermore, relying solely on observing the raw attention output of a single layer may not yield optimal results.Abnar et al. also proposed a linear combination of attentions to address this problem.Although this approach improves upon the noise problem associated with raw attention, it still faces challenges in accurately distinguishing between foreground and background regions.
Chefer et al. [26] proposed a novel explainability method that assigns relevance based on the Deep Taylor Decomposition principle.This method uses Layer-wise Relevance Propagation (LRP) [43] to calculate the scores of each attentionhead related to the class-token in each block.Combining the gradient information of the back-propagation gradient makes this method a class-specific explainability method.However, due to the existence of activation functions, gradients in the back-propagation process may suffer from issues such as gradient vanishing, resulting in sparse and noisy explainability maps as outputs.

III. METHODS
This section provides an overview of the vision transformer and then introduces our proposed R-Cut method.

A. Vision transformer (ViT)
The ViT model is a popular approach for image classification tasks that uses a transformer-based architecture.Given an input image X with resolution A × B. The network first split X into several non-overlapping patches.If the size of each patch is p×p, the total number of patches would be S = A×B p×p .Each patch is then flattened and linearly embedded into a token vector where D is the dimension of each token vector.
To enable the network to learn global features, a randomly initialized class token t 0 cls ∈ R 1×D is added to the tokens.Finally, the position embeddings are added to each of the tokens to form the input of the transformer block.If there are L cascaded transformer blocks, the input to each transformer block would be t l ∈ R (S+1)×D , where l = 1, • • • , L. In the Vision Transformer (ViT) architecture, each transformer block follows a specific arrangement of components.These components include layer normalization, a MHSA, a skip connection, and a Multilayer Perceptron layer (MLP).The input and output of each block consist of (S +1) discrete patch tokens, however each attention head only processes subspace tokens t, if the number of heads in the MHSA is H, the dimension of t should be D h = D/H and t ∈ R (S+1)×D h .
The MHSA of each layer A l h is calculated as follows: where f q , f k , and f v are linear transformation layers in the l-th block.A l h ∈ R (S+1)×(S+1) is the self-attention map of the input tokens from the h-th head in the l-th layer block.O l h ∈ R (S+1)×D h is the output of the head.The outputs O l h of all heads are concatenated and fed into an MLP block.
From the last transformer block the output class token t L cls is used to obtain the category probability vector V iT (X) if there are C categories, V iT (X) ∈ R 1×C .The vector V iT (X) is generated as follows: where MLP denotes the classification head implemented by the MLP block.The corresponding class can be selected by taking the maximum value in the generated vector V iT (X).

B. Relationship weighted out and Cut
The method consists of two main stages, as depicted in Fig. 2. In the first stage, called "Relationship Weighted Out", the objective is to extract class-aware semantic information about the output results from the discrete intermediate tokens.The second stage, known as Fine-grained feature decomposition named "Cut", involves utilizing the class-specific intermediate vectors obtained in the first stage to construct a novel graph.Subsequently, graph cut operations are performed on the graph to derive foreground information that corresponds to the target.By leveraging these operations, the method generates a visual explainability map specific to the class based on the foreground information.
1) Relationship weighted out : In this stage, we extract the class-aware semantic information related to the output results from the discrete patch tokens.Since directly extracting class-aware semantic information from the discrete tokens is challenging, we propose a perturbation map-based approach to obtain the class-aware weight information.This approach consists of two main parts: generating alternative activation maps M and calculating the class-aware weighting scores w to extract class-aware patch tokens t c .
Generating alternative activation maps M : As discussed in III-A, ViT utilizes discrete tokens to convey information.The intermediate discrete tokens involved in the forward transmission process carry semantic information of the corresponding category, as the network propagates category information during forward propagation.However, within each transformer block there are multiple intermediate tokens.To address the interference caused by the skip connection, we select the output of the normalization layer after the skip connection in the last block to extract semantic information.We firstly generate the patch tokens t L S by removing the last layer class token t L cls from the output of the last layer normalization

R-Out
Fig. 2. Overall architecture for our method.First, we extract t L S from ViT. Next, We use our "R-Out" module to extract class-aware token t c .We then employ the "Cut" module for fine-grained feature decomposition.By combining these modules, we obtain class-specific explainability maps.
t L ∈ R (S+1)×D .Then the alternative activation maps M will be generated from patch tokens t L S as follows: Where reshape(•) denotes the deserialization operation that can regroup the discrete patch tokens into a matrix map format, M ∈ R ( A p × B p )×D .Generating perturbation maps P .In this method, we consider M as D heat maps and perturb the original input image X through those heat maps to obtain perturbation maps P ∈ R ((A×B×3)×D) .The formula is shown as follows: where up(•) stands for up-sampling with a scale factor of p.
Calculate the class-aware weighting scores w.To compute the weight scores w for each perturbation map P i , we input both the perturbation map matrix P and the original image X into the pre-trained ViT model.Then, we use the similarity between the output vectors to compute the weight scores w for each perturbation map P i .A higher similarity between the output vectors indicates a stronger contribution of the corresponding perturbation map to the target class, which is calculated as follows: Where w is a row vector of size D, D is the number of perturbation maps.V iT (•) denotes the output vector of the ViT model.C represents the length of the output vector.
Extracting class-aware patch tokens t c : Since the perturbation maps P are generated based on the original patch tokens t L S , the weight of each dimension of P regarding the original output result is equivalent to the weight of each dimension of the patch tokens t L S regarding the original output result.Therefore, we can extract t c ∈ R S×D using the following formula : 2) Fine-grained feature decomposition: In this section, we will discuss how to finely partition the foreground and background information related to the category from the discrete tokens t c obtained from the III-B1.In our previous research [28], we experimented with a simple method of summing all the dimensions of t c and reshaping the result to obtain the explainability feature map.The result shows that even using such a simple method, we can also get a good result.However, this straightforward method does not consider the spatial position relationship of the discrete patch tokens and it may not effectively address the issue of local discontinuities in the generated explainability map.To overcome these limitations and achieve more precise foreground-background partitioning, we propose a new method based on the graph cut technique discussed in Appendix B.
Firstly, we generate a class-aware weighted graph G = (V , e) using the class-aware patch tokens t c .This graph considers both the direct relationship between nodes and the positional embedding relationship between the patch tokens.Next, we perform graph cut operations on this weighted graph to decompose it and obtain the corresponding class-specific eigenvector y 1 .By leveraging the class-specific eigenvector y 1 , we can identify the foreground vector y c 1 associated with the target class.
Construct a class-aware weighted graph G: We generate the corresponding graph based on the class-aware patch tokens t c .Specifically, we select the S class-aware patch token as the S nodes in the graph, resulting in V .Next, we define the edge e ij between two tokens V i and V j as the cosine similarity between them, incorporating both semantic and spatial information.By computing these similarities, we can obtain e.The formula for calculating the edge weights is as follows: where ϕ is a settable hyperparameter representing a constraint on the edges, we consider two nodes to be related only if the similarity between them exceeds ϕ.
Get the eigenvector y 1 : To obtain the eigenvector y 1 , we apply the normalized cut (Ncut) method described in Appendix B to partition the class-aware weighted graph G.This involves computing the generalized eigensystem (K − e)y = λKy of G and extracting the second smallest eigenvector y 1 ∈ R 1×S .The Appendix B provides a proof that the eigenvector y 1 is the Ncut of the class-aware solution of G, which is the classaware vector we need corresponding to the target class.
The goal is to generate the explainability visualization map L R−Cut by partitioning the class-specific foreground and background information.To achieve this, we determine the splitting point by taking the mean value ȳ1 =

IV. EXPERIMENTS A. Experiment setting
To verify the effectiveness of our class-specific post-hoc visualization explainability method, we conducted three kinds of evaluation experiments (i.e., the point game [44], the weakly supervised localization, the perturbation test) with four SOTA explainability methods on ImageNet1K [27], i.e., raw-attention [22], [23], [24], rollout [25], grad-cam [30], and Hila's method [26].These methods belong to three different architectures: raw-attention and rollout are attention-based, grad-cam is gradient-based, and Hila's method is a combination of attention and gradient-based approaches.We also performed three kinds of ablation experiments to verify the effectiveness of the different modules proposed in our methods.To further validate the applicability of our approach in real-world complex scenarios, we also tested our method on the LRN dataset, which focuses on autonomous driving risk warning [28].Lastly, we performed multiple sets of hyperparameter comparison experiments to ensure the rationality of the designed hyperparameters throughout our experiments.
1) Datasets: We evaluated the proposed method (R-Cut) on ImageNet1k [27] and LRN [28] datasets to verify the accuracy and effectiveness in generating explainability maps.Each of these two data brings different explainability map challenges.
ImageNet1k contains 1000 categories of image information, 1.28 million data for training, and 50,000 datasets for variation.The 1000 object categories in ImageNet1k include common object classes found in daily life, as well as relatively similar inter-class categories with small differences, such as numerous bird families and canines.This dataset contains many single-class but multi-objects in the validation set, which will cause the missed detection problem to the generated explainability image.The biggest challenge for the fine-grained classes is the tendency of explainability maps to focus on discriminative regions due to the small inter-class differences.For example, in the case of birds like snowbirds and bulbuls, which differ mainly in the shape of their beaks, the explainability maps tend to cluster around the beak area.
The LRN dataset is a linguistic warning dataset we created for risk scenes in autonomous driving scenarios [28].This data contains a total of 34488 images and 10 linguistic cue categories.Each risk cue category consists of the type of risk object "car, cyclist, and pedestrian" and the general orientation information "ahead, ahead right, and ahead left" (e.g.watch out for the pedestrian ahead right).Therefore, even the same risk object in this data can be a different category depending on its location.The main challenges of this dataset are the complexity of the road scenarios and the influence of location information on the explainability maps.
2) Implementation Details: In our experiments, we used the same pre-trained ViT-base model as the backbone for our explainability maps tests to ensure fairness.The following hyperparameters were selected: the input X is a 3-channel 224×224 RGB image, each patch size of the patch embedding is 16 × 16, the number of heads in the MHSA layer is 12, and the number of transformer blocks is also 12.And we take 0.05 for the similarity threshold ϕ in constructing the graph.All our experiments are trained and tested on an RTX A6000 GPU with a batch size of 256 and 200 epochs of iterations during training.

B. Evaluation matrices
For the quantitative experiments, we employed three commonly used evaluation metrics to assess the quality of explainability: Point game, IoU (Intersection over Union), and Perturbation test.
1) The Point game test: As described in [44], this method evaluates the correctness of the explainability map by checking whether the highest pixel value in the generated explainability image falls within the ground truth (GT) bounding box of the target object.If the highest pixel value is located within the GT bounding box, indicating that the network's explainability map correctly explains the object category.
The formula for this metric can be expressed as: where N represents the total number of samples, x i refers to the input image of the i-th sample, y i denotes the ground truth label of the target category, f is the trained classification model, M ij represents the pixel value at position j in the generated explainability image, and GT i is the ground truth bounding box for the target category y i .
The indicator function [f (x i ) = y i ] is equal to 1 when the predicted label of the model f is the same as the true label y i , otherwise it is equal to 0. Therefore, this metric is a weighted average of classification accuracy and explainability, where the weight of explainability is determined by the highest pixel value M ij.
2) The IoU test: In the experiment on weakly supervised localization IoU conducted by [45], we followed a specific procedure.Firstly, the generated explainability feature map was upsampled to match the size of the original image.Next, we set threshold thres = 0.2 to discard some background regions.Subsequently, the region within the explainability map was utilized to generate the predicted bounding box A by enclosing it with the minimum outer rectangle.Lastly, we employed Intersection over Union (IoU) as the evaluation metric to assess the quality of object-level localization achieved by the explainability feature map.
The formula for this metric can be expressed as: where B is the GT bounding box.
3) The perturbation test: This test consists of two experiments: Most Relevant First Perturbation (MRFP) and Least Relevant First Perturbation (LRFP) as described in the work by Hila's method [46].
In MRFP, we begin by masking off the most relevant pixel part of the explainability map and generate the corresponding perturbation map.We then input the perturbation map into the trained model and observe the statistical change in the corresponding target's confidence.A larger confidence change indicates better performance.
In LRFP, we preferentially mask off the most irrelevant part of the explainability map.We hope that the change in confidence is as small as possible because the removed part does not belong to the target in theory.
Throughout our experiments, we incrementally increase the proportion of masked pixels from 10% to 90%.We calculate the mean value of the confidence change as the actual confidence change value.

C. Experiment results
1) Performance in ImageNet1K: This section encompasses various types of qualitative and quantitative analysis on Ima-geNet1K dataset.For our qualitative analysis, we conducted post-hoc explainability visualization experiments on singleclass single-object images, single-class multi-object images, multi-class single-object images, and multi-class multi-object images, respectively.Regarding our quantitative analysis, we employed three different tests: the point game, IoU, and the perturbation test.
Fig. 3 presents the performance of our R-Cut method and other methods on the Imagenet1k dataset for singleclass single-object images, single-class multi-object images, and fine-grained images (the bird family) with small interclass differences.The explainability visualization experiments were conducted separately for regular-shaped objects and irregularly-shaped objects in order to ensure fairness.
As shown in Fig. 3, the raw-attention and rollout methods exhibit more background noise, while the grad-cam method accurately locates the object but only highlights the discriminative regions.Hila's method is relatively effective in activating the corresponding regions but still exhibits local discontinuities in the explainability map.In contrast, our R-Cut method eliminates the background noise and mitigates the discriminative regions problem in fine-grained categories (d) and (e).Moreover, our method accurately identifies all objects in single-class multi-object images (c) and (f).To demonstrate that our method is a class-specific approach, we conducted comparative explainability visualization analysis on multi-classes images, such as the classic "dog and cat", and "elephant and zebra".The purpose is to show different corresponding explainability visualizations for different object categories within the same image.
As shown in Fig. 4, the raw-attention method and rollout method are class-agnostic methods, while the grad-cam method and Hila's method can visualize different classes of objects, but suffer from background noise interference and local discontinuity problems.In contrast, our method can not only visualize the explainability maps of different classes but also generate regions of explainability maps that can effectively mask objects.Our R-Cut method can also visualize and explain multi-classes multi-objects images clearly.
Point game test results: Table I shows the results of the point game localization experiments on ImageNet1k dataset with explainability maps.It is evident that our method outperforms the SOTA method by 2.36% on the ImageNet1K dataset when utilizing GT categories.Additionally, without the knowledge of GT categories, our method still achieves a notable improvement of 1.61% compared to the previous SOTA method.These results emphasize the effectiveness and superiority of our method in accurately localizing objects within the ImageNet1K dataset.
IoU test results: Table II presents the results of the pixellevel explainability localization IoU experiments.Our method demonstrates a significant improvement of 4.5% (with GT) and 4.09% (without GT) on the ImageNet1K dataset when compared to the previous method by Hila.These results validate the enhanced completeness and explainability of our method in localizing object pixels.
Perturbation test results: The above two test metrics are artificially defined metrics, in order to get a good explanation to reflect the actual regions that the model is using, we also  conducted a perturbation test.For MRFP, where we mask off the most relevant region related to the model's prediction, we expect a high confidence change in the model's prediction about the corresponding category.Our method demonstrates a significant improvement of 3.6% compared to Hila's SOTA method.For the LRFP we believe that the masked-out region should be irrelevant to the model prediction, so we hope that the impact on confidence is as small as possible.We can see that our method's LRFP result is 15.69% which is also a reduction of 1.22% compared to Hila's method.Both qualitative and quantitative results show that our explainability visualization method is much better than the previous SOTA method on the ImageNet1K dataset.
2) Performance in LRN dataset: To verify the effectiveness of our method in complex scenarios, we also performed qualitative and quantitative analysis on the hazard warning dataset LRN [28] for autonomous driving scenarios.Fig. 5 shows the explainability visualization results of our R-Cut method and other methods on the LRN dataset.We visually post-hoc explained each of the three risk categories: dangerous vehicle, dangerous cyclist, and dangerous pedestrian.The visualizations clearly demonstrate that our method can visually explain the situation accurately even in traffic scenes with complex backgrounds.Point game test results: Table IV shows the results of our method and other SOTA methods in point game localization experiments on LRN dataset with the generated explainability maps.Our method outperforms the previous SOTA method with significant improvements.Specifically, our method achieves a remarkable improvement of 21.44% without GT and 21.67% with GT compared to the previous SOTA method.These results demonstrate the superior objectlevel explainability localization performance of our method in driving scenes.
IoU test results: Table V shows the results of the pixellevel explainable localization IoU experiments.our method and other baselines were evaluated on the LRN dataset.It is observed that our method achieved a notable improvement of 5.34% without GT category and 5.56% with GT category compared to Hila's method.These results demonstrate that our method can more completely explain the pixels that belong to the risk object.Perturbation test results: In the MRFP test, we aimed to observe the impact on the output perturbation map confidence after the perturbation, and we expected to see a significant impact.As shown in Table VI, our method outperformed Hila's method by 5.73% in this test.In the LRFP test, our method outperformed Hila's method with a reduction of 1.62%.
3) Ablation test: To validate the efficacy of our proposed two modules, we conducted qualitative and quantitative experiments to evaluate three method variants: (1) only Relationship weighted out, (2) only Cut, and (3) R-Cut.As shown in Fig. 6,  the Relationship weighted out method includes a class-aware function, but it does not consider spatial location relationships, which leads to local discontinuities.For example, the chest position of the dog is not activated in the R-Out column in Fig. 6(a).On the other hand, the Cut method generates locally dense explainability maps by considering location, texture, and color information during the graph decomposition process, but it remains a class-agnostic map.Moreover, since color information is considered in the computation process, the Cut method considers the brown desktop and the black drawer in Figure 6(b) as not belonging to the same entity.In contrast, the R-Cut method can generate both class-aware and dense explainability maps.
Table VII shows the performance of the three method variants on Point game, IoU, and Perturbation test experiments, and it is evident that the R-Cut method achieves the best results.The experimental results demonstrate that only R-Cut can generate a fine-grained class-specific explainability map.
Furthermore, we present the localization results of our method for the point game test with different hyperparameters ϕ to demonstrate the rationality of our chosen values.As depicted in Table VIII, it is evident that our method achieves the best performance when ϕ = 0.05.

V. CONCLUSION
This paper introduces a novel post-hoc visualization explainability method for Transformer-based image classification tasks.Our method addresses the crucial need for trust and understanding in classification results.Through our proposed "Relationship weighted out" module, we can obtain classspecific information from intermediate layers, enhancing the class-aware explainability of the discrete tokens.Additionally, our "Cut" module enables fine-grained feature decomposition.By combining the two modules we can generate dense classspecific visual explainability maps.
We extensively evaluated our method on the ImageNet dataset, conducting both qualitative and quantitative analyses.Furthermore, we tested the explainability of our method in complex backgrounds by performing numerous experiments on the LRN dataset for automatic driving danger alerts.The results of both sets of experiments demonstrate significant improvement of our method compared to previous SOTA approaches.Additionally, through ablation experiments, we provide further validation of the effectiveness of the different modules proposed in our method.
Overall, our method not only enhances trust in Transformerbased image classification but also contributes to the comprehension of the model benefiting downstream tasks.In the future, we plan to extend our work to perform explainability experiment on multi-modal tasks.

APPENDIX A ERROR ANALYSIS
To further investigate the limitations of our R-Cut method, we examined the results of all incorrect explainable estimates and summarized the reasons that led to inaccurate output explainability maps as follows.
Reason 1: The ImageNet1K dataset contains many hardto-predict samples, resulting in deviations between the model predictions and the ground truth class.our method does not work well when the model itself predicts incorrectly.To verify this conjecture, we removed the results in the test samples where the model itself predicted incorrectly and reran the point game and IoU tests.Finally, our method achieved 61.01% of mIoU in IoU test and 81.25% in point game test, which are 2.22% and 1.16% improvements compared to the previous results, respectively.
Reason 2: The ImageNet1K dataset contains some test samples that have multiple classes, while ImageNet1K itself is a single-target classification dataset.This leads to incomplete prediction results, and the generated explainability map results only contain one class.As shown in Fig. 7, in image (a), the ground truth bounding box results in an "instrument", but our model's localization results in a "dog".Because in the ImageNet1K data, the "dog" is also a class, but the ground truth of this image is not labeled with multi-class labels.Similarly, Figure (b) is also a multi-category image, but only with a single class label.

APPENDIX B GRAPH CUT
The Ncut algorithm is a typical graph cut method that has been widely used in various fields, including computer vision, pattern recognition, and image processing, due to its effectiveness and efficiency.It was first introduced by Shi et al. in 1997 [47].In traditional image segmentation, the algorithm represents an image as a graph, where each pixel block is considered a node in the graph.The correlation between pixel values is used to generate a weighted graph V .Based on the weighted graph, the algorithm actively partitions the image into two disjoint regions, I and J , which exhibit similar features such as texture or color.
The Ncut algorithm defines the cut cost as a fraction of the total edge connections to all the nodes in the graph.The optimal segmentation is achieved by minimizing the following equation: N cut(I, J ) = cut(I, J ) sim(I, V ) + cut(I, J ) sim(J , V ) , where cut(I, J ) is defined as the sum of the edge weights between nodes in I and nodes in J , i.e., cut(I, J ) = u∈I,f ∈J w(u, f ).Similarly, sim(I, V ) and sim(J , V ) are defined as the sum of the edge weights between nodes in I and V and between nodes in J and V , respectively.
By minimizing the Ncut equation, the algorithm tries to maximize the cut cost while minimizing the similarity between the two regions.This ensures that the resulting segmentation has high inter-cluster similarity and low intra-cluster similarity.
Jianbo Shi et al [47] showed that by setting y = (1 + x) − b (1 − x) under the condition y T K1 = 0, it can be proven that the minimum value of N cut(X) is achieved by minimizing the following equation: Where K is a diagonal matrix of size S × S, where k(i) = j w(i, j) represents the sum of the weights between the ith token and the other tokens.e is an S × S dimensional symmetric matrix that describes the matrix of weights between tokens, where e(i, j) = w(i, j).
By minimizing the above equation, we can obtain the optimal partition of the graph into two disjoint regions with the same features, as required by the Ncut algorithm.
By setting Z = D But according to the article Ncut, equation 12 above is the Rayleigh quotient [48], and when constraint relaxation is performed on y, the equation above is equivalent to solving a standard eigensystem: K − 1 2 (K − e) K − 1 2 Z = λZ.It is easy to prove that for the minimum eigenvalue λ = 0 the eigenvector [49] is Z 0 = K 1 2 1.Since (K − e) is known to be positive semidefinite [50] Laplacian matrix.therefore the second smallest eigenvector Z 1 , is perpendicular to Z 0 .Based on this relation we can obtain and with y = K − 1 2 Z, we can get: Therefore the second smallest eigenvector of the generalized eigensystem (K − e)y = λKy is the real-valued solution to the Ncut problem.

S i y i 1 S
of the continuous eigenvector y 1 .Then we define the foreground set as f = {node i |y i 1 ≥ ȳ1 } and the background set as b = {node i |y i 1 < ȳ1 }.To eliminate the interference brought by the background information, we set all nodes in the background set to 0. The class-specific vector y c 1 is obtained by keeping the information of the foreground set unchanged.Finally, we can obtain our class-specific explainability visualization map L R−Cut as follows:L R−Cut = 0.5 * 255 * up(reshape(y c 1 )) + 0.5 * X

Fig. 5 .
Fig.5.Explainability visualization results for the LRN dataset.In this result "car" represents the warning "Watch out for the car ahead right"; "cyclist" represents the warning, "Watch out for the cyclist ahead left"; "pedestrian" represent the warning "Watch out for the pedestrian ahead right".

Fig. 6 .
Fig. 6.Ablation test for three method variants.Plots in even rows represent the heatmaps of the corresponding explainability maps.

Fig. 7 .
Fig. 7. Explainability visualization results for the wrong predicted images.Red rectangles represent the ground truth bounding box, green rectangle represents the bounding box of the predicted result.
min X N cut(X) = min y y T (D − e)y y T Ky (11)

TABLE VI MRFP
AND LRFP TEST FOR LRN DATASET